Convergence of Markov Chain

Statistics

By yuvrajiro

What is Markov Chain ?

Markov Chain is a Stochastic Model in which Future is dependent only on Present not on Past , What I mean to say that is

P(Xt+1Xt,Xt1,...X2,X1)=P(Xt+1Xt)P(X^{t+1}|X^t,X^{t-1},...X^2,X^1) = P(X^{t+1}| X^t)

Transition Probability Matrix

Let us denote

pij=P(Xn+1=iXn=j)p_{ij} = P(X^{n+1} = i | X^n = j)

Where [MathProcessingError]pij[Math Processing Error]p_{ij} denotes the probability of going from state “j” to state “i” in one step, similarly we can define [MathProcessingError]pijn[Math Processing Error]p_{ij}^n as the probability of going from state “j” to state “i” in n steps, we can create Transition Probability Matrix as

TPM=[p11 p12 p13 . . p21 p22 p23 . .  . . . . . . . . .  . . . . . . . . . ]TPM = \begin{bmatrix} p_{11} \ p_{12} \ p_{13} \ . \ .\ \\ p_{21} \ p_{22} \ p_{23} \ . \ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \end{bmatrix}

Why Convergence of Markov Chain Important ?

Revisit MCMC

However MCMC have vast usage in the field of Statistics, Mathematics and Computer Science, here we will discuss simple problem in Bayesian Computation , and asses why convergence of Markov Chain is Important

Let us assume that we want to estimate certain parameter t(θ)t(\theta) and the model is given such that g(θ)g(\theta) is prior density for θ\theta and f(yθ)f(y \| \theta) is likelihood of y=y1,y2......yny = {y_1,y_2 ......y_n} give the value of θ\theta then the posterior can be written as

g(θy)g(θ)f(yθ)g(\theta | y ) \propto g(\theta)f(y|\theta)

Which have to be normalized , then the posterior density will be given by

g(θy)=g(θ)f(yθ)g(θ)f(yθ)dθg(\theta | y ) = \frac{g(\theta)f(y|\theta)}{\int g(\theta)f(y|\theta)d\theta}

For the sake of simplicity let us assume t(θ)=θt(\theta) = \theta and let us assume θ^\hat{\theta} is an estimate, then take Square Error Loss Function

L(θ,θ^)=(θθ^)2L(\theta , \hat{\theta}) = (\theta - \hat\theta)^2

Then the Classical Risk will be given by Rθ^(θ)=Eθ(L(θ,θ^))R_{\hat{\theta}}(\theta) = E_{\theta}(L(\theta,\hat{\theta})) and Bayes Risk is given by

r(θ^)=Rθ^(θ)g(θ)dθr(\hat{\theta}) = \int R_{\hat{\theta}}(\theta)g(\theta)d\theta

Now our target is to minimize bayes risk to get the bayes estimate

r(θ)=Rθ^(θ)g(θ)dθ=Eθ(L(θ,θ^))g(θ)dθ=((θθ^)2f(yθ)dy)g(θ)dθ=((θθ^)2f(yθ)g(θ)dydθ)=((θθ^)2g(θy)dθ)f(y)dy\begin{align*} r(\theta) &= \int R_{\hat{\theta}}(\theta)g(\theta)d\theta \\ &= \int E_{\theta}(L(\theta,\hat{\theta})) g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)dy\right)g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)g(\theta)dyd\theta\right) \\ &= \int \left( \int (\theta - \hat\theta)^2g(\theta|y)d\theta\right)f(y)dy \tag{1} \end{align*}

The equation (1)({1}) can be minimized if the inner integral is minimized, when

θ^=E(θy)\hat{\theta} = E(\theta |y)

Now we may not always able to calculate mean of posterior density, that means

θ^=θg(θy)dθ\hat\theta = \int\theta g(\theta|y)d\theta

That is when we do not know the kernel density , and integral will be complex , then we use CLT to estimate θ\theta that is we take random samples from the kernel g(θy)g(\theta \| y) i.e posterior kernel , and calculate the means of the samples , that can be mathematically seen as

X1,X2......Xn are samples from g(θy) nowXinθ^ as nX^1,X^2......X^n \ are \ samples \ from\ g(\theta|y) \ now \\ \frac{\sum X_i}{n} \to \hat\theta \ as \ n \to \infty

When does Markov Chain Converge ?

Now let us take [MathProcessingError]g(θy)=π(θ)[Math Processing Error]g(\theta \| y) = \pi(\theta) it can be assumed because y is realized and [MathProcessingError]g(θy)[Math Processing Error]g(\theta \| y) is only function of [MathProcessingError]θ[Math Processing Error]\theta , Now comes the MCMC , if we can create a chain whose stationary distribution is [MathProcessingError]π(θ)[Math Processing Error]\pi(\theta), then we can assume that chain as a random samples which converges to [MathProcessingError]π(θ)[Math Processing Error]\pi(\theta) and that is the reason we need Markov Chain to converge, before we move forward let us describe some definitions

Let us denote π\pi as a probability measure on (X,B)(\mathcal{X},\mathcal{B}) and Φ={X0,X1...}\Phi = \{X^0,X^1 ...\} are discrete time Markov Chain on (X,B)(\mathcal{X},\mathcal{B}) , let us assume transition kernel P and k as transition density and can be illustrated as

P(x,A)=Pr(Xi+1AXi=x)=Ak(x,y)dyP(x,A) = Pr(X^{i+1} \in A | X^i = x ) = \int_A k(x,y)dy

that is P(x,A)P(x,A) gives us the probability of one step transition probability from state x to any state in A, now the transition kernel assumes two linear operators

  1. λP\lambda P where λ\lambda is probability distribution on (X,B)(\mathcal{X},\mathcal{B})
  2. Pf where f is non-negative measurable function on on (X,B)(\mathcal{X},\mathcal{B})

where

λP(A)=Xλ(x)P(x,A)dx\lambda P(A) = \int_{\mathcal{X}}\lambda(x)P(x,A)dx

so if XiλX^i \sim \lambda then λP(A)\lambda P(A) is the marginal distribution of Xi+1AX^{i+1}\in A and

Pf(x)=XP(x,dy)f(y)=Ep[f(Xi+1)Xi=x]Pf(x) = \int_{\mathcal{X}}P(x,dy)f(y) = E_p[f(X_{i+1})|X_i = x]

and m-step transition probability is given by

Pm(x,A)=Akm(x,y)dyP^m(x,A) = \int_A k^m(x,y)dy

Invariant Density - π\pi

π=πPπ(x)=Xπ(y)k(y,x)dy\pi = \pi P \\ \Rightarrow \pi(x) = \int_{\mathcal{X}}\pi(y)k(y,x)dy

Now there are several way to ensure π\pi is invariant (or stationary ) distribution one of the way is , to satisfy the balance condition i.e

π(x)k(x,y)=π(y)k(y,x)           for all x,yX\pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X}

Proof

Suppose π\pi satisfy the balance condition then

π(x)k(x,y)=π(y)k(y,x)           for all x,yXXπ(y)k(y,x)dy=Xπ(x)k(x,y)dy=π(x)            \begin{align*} \pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X} \\ \\ \int_{\mathcal{X}}\pi(y)k(y,x)dy = \int_{\mathcal{X}}\pi(x)k(x,y)dy = \pi(x) \ \ \ \ \ \ \ \ \ \ \ \ \end{align*}

However Balance condition is not necessary condition it is only sufficient that means Reversibility is not required for π\pi to be invariant, suppose XiπX^i \sim \pi and it preserve it distribution over any number of transition , then we say that the Markov chain is stationary and hence it converges to π\pi that is required for MCMC

Let us Define

ϕ\phi-irreducible A Markov Chain is for some measure ϕ\phi on X,B\mathcal{X},\mathcal{B} if for all xXx \in X and ABA \in \mathcal{B} for which ϕ(A)>0\phi(A) > 0 , there exist n for which Pn(x,A)>0P^n(x,A)>0

A Chain is Aperiodic if Period is 1

Harris Recurrent A ϕ\phi- irreducible Markov Chain is Harris Recurrent if a ϕ\phi positive set A, the chain reaches set A with probability 1

Harris Ergodic A Markov Chain is said to be Harris ergodic if it is ϕ\phi irreducible , aperiodic , Harris Recurrent and posses invariant distribution π\pi for some measure ϕ\phi and π\pi

Total Variation Distance The Total Variation distance between two measures μ(.) and v(.)\mu(.) \ and \ v(.) is defined by

μ(.)v(.)=supABμ(A)v(A)|| \mu(.) - v(.)|| = sup_{A \in \mathcal{B}}|\mu(A)-v(A)|

What does Harris Ergodicity Guarantees ?

  • Guaranteed to explore entire space without getting stuck
  • Strong Consistency of Markov Chain Average
  • Convergence of Markov Chain to stationary in total Variation Distance

The following two theorems are very important for MCMC

Ergodic Theorem A Markov chain Φ\Phi is Harris ergodic with Invariant Distribution π\pi and Eπg(X)<E_{\pi} \| g(X) \| < \infty for some function g:XRg : \mathcal{X} \to \Bbb{R} Then for any starting value xXx \in \mathcal{X} , then

gˉn=1ni=0n1g(Xi)Eπg(X) almost surely as n \bar{g}_n = \frac{1}{n}\sum_{i=0}^{n-1}g(X^i) \to E_{\pi}g(X) \ almost \ surely \ as \ n \ \to \infty

and that is the main requirement that we use generally in MCMC

Birkhoff, George D. “Proof of the Ergodic Theorem.” Proceedings of the National Academy of Sciences of the United States of America, vol. 17, no. 12, 1931, pp. 656–660. JSTOR, www.jstor.org/stable/86016. Accessed 9 Apr. 2021.

The other Theorem is as follows

*Suppose Markov chain Φ\Phi is Harris ergodic with invariant distribution of π\pi Then for any starting value xXx \in \mathcal{X} . Φ\Phi will converge to π\pi in total variation distance , i.e

Pn(x,.)π(.)0 as n||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty

further Pn(x,.)π(.)\| \| P^n(x,.) - \pi(.)\| \| is monotonically non-increasing in n

Rate of Convergence

The Ergodic Theorem tells us about convergence of Markov chain however it does not declare anything about the rate of convergence, we define a Markov Chain converging at geometric rate as geometrically ergodic, i.e there exist M:XRM:\mathcal{X} \to \Bbb{R} and some constant t(0,1)t \in (0,1) that satisfy

Pn(x,.)πM(x)tn     for any xX||P^n(x,.)-\pi|| \leq M(x)t^n \ \ \ \ \ for \ any \ x \in \mathcal{X}

If M is bounded , the Markov chain is uniformally ergodic

  • As long as the starting value of x , such that M(x) is not large, geometric ergodicity guarantees quick convergence of Markov Chain
  • Geometric Ergodicity holds for every irreducible and aperiodic Markov chain on finite space

What is Needed for Geometric Ergodicity

Drift and Minorization Condition

A Type 1 drift condition holds if there exist some non-negative function V:XR0V:\mathcal{X} \to \Bbb{R}_{\geq 0} and constant 0<γ<10 < \gamma <1 and L<L < \infty

PV(x)γV(x)+L            for any xXPV(x) \leq \gamma V(x) + L \ \ \ \ \ \ \ \ \ \ \ \ for \ any \ x \in \mathcal{X}

Further we call V a drift function and a γ\gamma a drift rate

A Minorization condition holds on set CBC \in \mathcal{B} if there exist some positive integer m,ϵ>0m ,\epsilon > 0 and probability measure Q in (X,B)(\mathcal{X},\mathcal{B}) for which

Pm(x,A)ϵQ(A)P^m(x,A) \geq \epsilon Q(A)

we can also call this m step minorization condition, here C is called small, It imply the following condition

km(x,y)ϵq(A)k^m(x,y) \geq \epsilon q(A)

Proposition

Suppose Markov chain Φ\Phi is irreducible and periodic with invariant distribution π\pi , Then Φ\Phi is geometrically ergodic if the following two conditions are met:

  1. Type I drift condition hold
  2. There exists some constants d>2L(1γ)d > 2L(1-\gamma) for which one step minorization condition holds on set C={x:V(x)d}C= \{x:V(x)\leq d\}

This Proposition is a Corollary of Rosenthal(1995a)

Let Φ\Phi be a a periodic and irreducible Markov chain with invariant distribution π\pi

Let us suppose the Condition 1&2 of Proposition holds and X0=x0X^0 = x_0 be the starting value and define

α=1+d1+2L+γd      and       U=1+2(γd+L)\alpha = \frac{1+d}{1+2L+\gamma d} \ \ \ \ \ \ and \ \ \ \ \ \ \ U = 1+2(\gamma d+L)

Then for any r(0,1)r \in (0,1)

Pn(x0,.)π(.)(1ϵ)rn+(Urα1r)n(1+L1γ+V(x0))||P^n(x_0 ,.) - \pi(.)|| \leq (1-\epsilon)^{rn} +\left(\frac{U^r}{\alpha^{1-r}} \right)^n\left(1 + \frac{L}{1-\gamma} + V(x_0)\right)

We can rearrange this to see that is satisfy geometric ergodicity condition

  1. V(x) + 1 is proportion to M(x) hence starting point should minimize V(x)

Type II Drift Condition : If there exist some function W : X[1,)\mathcal{X} \to [1,\infty) finite at some x X\in \mathcal{X}, some set DBD \in \mathcal{B} , and constants 0<ρ<10 < \rho < 1 and b<b < \infty for which

PW(x)ρW(x)+bID(x)               for all xXPW(x) \leq \rho W(x) + bI_D(x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ for \ all \ x \in \mathcal{X}

It is easy to show that Type I Drift Condition \Leftarrow\Rightarrow Type II Drift Condition

Finally we can say that

Suppose Markov Chain Φ\Phi is aperiodic and ϕ\phi-irreducible with invariant distribution π\pi. Then Φ\Phi is geometrically ergodic if there exist some small set D, the drift function W:X[1,)W: \mathcal{X} \to [1,\infty) and some constants 0<ρ<10 < \rho < 1 and b<b < \infty for which a type II drift conditions hold

Now Let me reinstate the earlier theorem

Suppose Markov chain Φ\Phi is Harris ergodic with invariant distribution of π\pi Then for any starting value xXx \in \mathcal{X} . Φ\Phi will converge to π\pi in total variation distance , i.e

Pn(x,.)π(.)0 as n||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty

further Pn(x,.)π(.)\| \| P^n(x,.) - \pi(.)\| \| is monotonically non-increasing in n

Jain and Jamison (1967) have shown that for every ϕirreducible\phi-irreducible Markov chain on (X,B)(\mathcal{X},\mathcal{B}) . Then there exists some small set CBC \in \mathcal{B} for which ϕ(C)>0\phi(C) > 0.Furthermore , the corresponding minorization measure Q(.) can be defined so that Q(C) > 0

the Jain and Jamison allow us to assume CBC \in \mathcal{B} such that

P(x,A)ϵQ(A)     for all xCP(x , A) \geq \epsilon Q(A) \ \ \ \ \ for \ all \ x \in C

That is one step minorization condition , Now we can write

P(x,A)=ϵQ(A)+(1ϵ)R(x,A)       for all xC and ABP(x,A) = \epsilon Q(A) + (1-\epsilon)R(x,A) \ \ \ \ \ \ \ for \ all \ x \in C \ and \ A \in \mathcal{B}

Here R(x,.)R(x,.) is probability measure for (X,B)(\mathcal{X},\mathcal{B}) , then this allow us to construct two separate chain which couple with probability 1

Φ(X)={X0,X1...........}Φ(Y)={Y0,Y1............}\Phi(X) = \{X^0,X^1 ...........\} \\ \Phi(Y) = \{Y^0,Y^1 ............\}

Now (Xn,Yn)(Xn+1,Yn+1)(X^{n},Y^n) \to (X^{n+1},Y^{n+1}) with the following algorithm

  1. While XnYnX^n \neq Y^n
    1. If (Xn,Yn)∉C×C(X^n,Y^n) \not\in C \times C
      1. Draw Xn+1P(Xn,.)X^{n+1} \sim P(X^n,.) andYn+1P(Yn,.)Y^{n+1} \sim P(Y^n,.) independently
    2. If (Xn,Yn)C×C(X^n,Y^n) \in C \times C
      1. Draw δnBern(ϵ)\delta_n \sim Bern(\epsilon)
      2. If δn=0\delta_{n} = 0 , Draw Xn+1R(Xn,.)X^{n+1} \sim R(X^n,.) andYn+1R(Yn,.)Y^{n+1} \sim R(Y^n,.) independently
      3. otherwise , draw Xn+1=Yn+1P(x,.)X^{n+1} = Y^{n+1} \sim P(x,.)
  2. Once Xn=x=YnX^n = x = Y^n ,draw Xn+1=Yn+1P(x,.)X^{n+1} = Y^{n+1} \sim P(x,.)

Now define coupling time T such that T denotes n for which first time (Xn1,Yn1)C×C(X^{n-1},Y^{n-1}) \in C \times C and δn1=1\delta_{n-1}=1 , once the chain couples it will remain equal

Now let us assume

X0=x and Y0πX^0 = x \ and \ Y^0 \sim \pi

And PrxPr_x denotes the probability with respect to starting point x, then Φ(y)\Phi(y) is stationary

Pn(x,A)π(A)=Prx(XnA)Prx(YnA)=Prx(XnA,Xn=Yn)+Prx(XnA,XnYn)Prx(YnA,XnYn)Prx(YnA,Xn=Yn)=Prx(XnA,XnYn)Prx(YnA,XnYn)max{Prx(XnA,XnYn)Prx(YnA,XnYn)}Prx(XnYn)=Prx(T>n)\begin{align*} |P^n(x,A) - \pi(A)| &= |Pr_x(X^n \in A) - Pr_x(Y^n \in A)| \\ &= |Pr_x(X^n \in A,X^n = Y^n) +Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n = Y^n)| \\ &= |Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)| \\ &\leq max\{Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)\} \\ &\leq Pr_x(X^n \neq Y^n) \\ &= Pr_x(T > n) \end{align*}

Thus

Pn(x,.)π(.)Prx(T>n)||P^n(x,.) - \pi(.)|| \leq Pr_x(T>n)

Now Let us Suppose Minorization condition hold over entire space i.e C=XC = \mathcal{X} in this case every couple generated belongs to C×CC \times C for all n then

TGeo(ϵ)P(T>n)=(1ϵ)nT \sim Geo(\epsilon) \\ P(T>n) = (1-\epsilon)^n

so

Pn(x,.)π(.)(1ϵ)n||P^n(x,.) - \pi(.)|| \leq (1-\epsilon)^n

so when C = X\mathcal{X} , Pn(x,.)π(.)0 as n \| \|P^n(x,.) - \pi(.) \| \| \to 0 \ as \ n \to \infty

and When CXC \neq \mathcal{X} , the distribution of P(X>t)P(X>t) is complicated and beyond the scope of this presentation

Deterministic Update Gibbs Sampler (DUGS)

Let us assume our Target Distribution is π(θ)\pi(\theta) such that θ=(θ1,θ2....θd)\theta = (\theta_1,\theta_2....\theta_d)

Notation : θi\theta_{-i} is vector of parameter except θi\theta_i

Initialization : θ0=(θ10,θ20......θd0)\theta^0 = (\theta_1^0,\theta_2^0......\theta_d^0)

Iteration: For i1i \geq 1

  • Sample θ1iπ(θ1iθ12)\theta_1^i \sim \pi(\theta_1^i \| \theta^2_{-1})
  • Sample θ2iπ(θ2θ1i,θ(1,2)i1)\theta_2^i \sim \pi(\theta_2 \| \theta_1^i , \theta^{i-1}_{-(1,2)})
  • ... .. .................................
  • ... ... .................................
  • Sample θ1iπ(θdθdi)\theta_1^i \sim \pi(\theta_d \| \theta^{i}_{-d})

The Transition Kernel for two parameter will be given by

k((θ1,θ2),(θ~1,θ~2))=π(θ~1θ2)π(θ~2θ~1)k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2)) = \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)

Let us check the stationarity for two parameter

π(θ1,θ2)k((θ1,θ2),(θ~1,θ~2))dθ1dθ2=π(θ1,θ2)π(θ~1θ2)π(θ~2θ~1)dθ1dθ2=π(θ2)π(θ~1θ2)π(θ~2θ~1)dθ2=π(θ~1,θ2)π(θ~2θ~1)dθ2=π(θ~1)π(θ~2θ~1)=π(θ~2,θ~1)\begin{align*} \int\int \pi(\theta_1,\theta_2)k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2))d\theta_1d\theta_2 &= \int\int \pi(\theta_1,\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_1d\theta_2 \\ &= \int \pi(\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \int \pi(\tilde\theta_1,\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \pi(\tilde\theta_1)\cdot \pi(\tilde\theta_2|\tilde\theta_1) \\ &= \pi(\tilde\theta_2,\tilde\theta_1) \\ \end{align*}

However this does not suffices for for the convergence, Aperiodicity needed for surety that the samples are not repeating hence leads to exploring whole space and Irreducibility confirms that it will not stuck If we are to prove the balance condition the we are assured that it will converge, Let Φi={θi0,θi1........}\Phi_i=\{\theta_i^0,\theta_i^1........\} and let k1(θ~1,θ1)k_1(\tilde\theta_1,\theta_1) be the transition density in Φi\Phi_i , then

π(θ1)k1(θ~1,θ1)=π(θ1)π(θ~1θ2)π(θ~2θ~1)dθ2=π(θ1)π(θ~1,θ2)π(θ2)π(θ~2,θ~1)π(θ~1)dθ2=π(θ~1)π(θ~1,θ2)π(θ2)π(θ~2,θ~1)π(θ1)dθ2=π(θ~1)π(θ~1θ2)π(θ~2θ1)dθ2=π(θ~1)k1(θ1,θ~1)\begin{align*} \pi({\theta_1}) k_1(\tilde\theta_1,\theta_1) &= \pi({\theta_1})\int \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &=\pi({\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\tilde\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int {\pi(\tilde\theta_1|\theta_2)}\cdot {\pi(\tilde\theta_2|\theta_1)}d\theta_2\\ &= \pi({\tilde\theta_1}) k_1(\theta_1,\tilde\theta_1) \end{align*}

Example

Let us suppose

Y1,Y2.....YmiidN(μ,θ)Y_1 , Y_2 ..... Ym \sim^{iid} N(\mu, \theta)

where m5m \geq 5 , Let us assume the joint prior density as

g(μ,θ)1θg(\mu,\theta) \propto \frac{1}{\sqrt{\theta}}

Let y = (y1,y1......ym)(y_1,y_1 ......y_m) as a sample data with mean yˉ\bar y and variance s2=(yiyˉ)2s^2 = \sum(y_i - \bar y)^2 the the posterior will be given by

g(μ,θy)θm+12exp(12θj=1m(yjμ)2)g(\mu , \theta | y) \propto \theta^{-\frac{m+1}{2}}exp \bigg( -\frac{1}{2\theta} \sum_{j=1}^m (y_j - \mu)^2\bigg)

and

θμ,yIG(m12,s2+m(μyˉ)22)μθ,yN(yˉ,θm)\theta | \mu,y \sim IG\left(\frac{m-1}{2}, \frac{s^2+m(\mu -\bar{y})^2}{2}\right) \\ \mu | \theta ,y \sim N(\bar y,\frac{\theta}{m})

We know Inverse Gamma have kernel x(a+1)ebxx^{-(a+1)}e^{-bx} with parameter (a,b)

Let us use DUGS Sampler in the following update scheme

(θ,μ)(θ,μ)(θ,μ)(\theta^{'},\mu{'}) \to (\theta^{},\mu{'}) \to (\theta^{},\mu{})

so the kernel density will be given by

k((μ,θ),(μ,θ))=π(θμ,y)π(μθ,y)k((\mu^{'},\theta^{'}),(\mu,\theta)) = \pi(\theta|\mu^{'},y)\pi(\mu|\theta,y)

Type 1 Drift Condition

Let us define V(μ,θ)=(μyˉ)2V(\mu , \theta) = (\mu - \bar{y})^2

E[V(μ,θ)μ,θ]=E[V(μ,θ)μ]=E[E[V(μ,θ)θ]μ]E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E[V(\mu,\theta)|\mu^{'}] =E[E[V(\mu,\theta)|\theta]|\mu^{'}]

where

E[V(μ,θ)θ]=E[(μyˉ)2θ]=Var[μθ]=θmE[V(\mu,\theta)|\theta] = E[(\mu-\bar{y})^2|\theta] = Var[\mu|\theta] = \frac{\theta}{m}

Then

E[V(μ,θ)μ,θ]=E[θmμ]1ms2+m(μyˉ)2m3(μyˉ)2m3s2m(m3)1m3V(μ,θ)+s2m(m3)E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E\left[\frac\theta m | \mu^{'}\right] \\ \Rightarrow \frac{1}{m} \frac{s^2+m(\mu^{'}-\bar{y})^2}{m-3} \\ \Rightarrow \frac{(\mu^{'}-\bar{y})^2}{m-3} \frac{s^2}{m(m-3)} \\ \Rightarrow \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}

now m5m \geq 5 guarantees that 1m3<1\frac{1}{m-3} < 1 hence

PV(μ,θ)=E[V(μ,θ)μ,θ]1m3V(μ,θ)+s2m(m3)PV(\mu^{'},\theta^{'}) =E[V(\mu,\theta)|\mu^{'},\theta^{'}] \leq \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}

So its satisfy drift condition with γ(1/(m3),1)\gamma \in (1/(m-3),1) and L2=s2/(m(m3))L^2 =s^2/(m(m-3))

Minorization Condition

Let us assume C={(μ,θ):V(μ,θ)d}C = \{(\mu,\theta) : V(\mu,\theta) \leq d \} for d2L/(1γ)d \geq 2L/(1-\gamma) if there exist density q and ϵ>0\epsilon > 0 for which

k((μ,θ),(μ,θ))ϵq(μ,θ) for all (μ,θ)C and (μ,θ)R×R+k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \epsilon q(\mu,\theta)\ for \ all \ (\mu^{'},\theta^{'}) \in C \ and \ (\mu, \theta) \in \Bbb{R} \times \Bbb{R}_+ k((μ,θ),(μ,θ))=π(μθ,y)π(θμ,y)π(μθ,y)inf(μ,θ)Cπ(θμ,y)k((\mu^{'},\theta^{'}),(\mu, \theta)) = \pi(\mu|\theta,y)\pi(\theta | \mu^{'},y) \geq \pi(\mu|\theta,y) \inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y)

Let us assume IG(a,b;x)IG(a,b ; x) denote the density at x>0 x>0

g(θ)=inf(μ,θ)Cπ(θμ,y)IG(m12,s22+m2(μyˉ)2;θ){IG(m12,s22+md2;θ)  if θ<θIG(m12,s22;θ)  if θθg(\theta) =\inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y) \\ \Rightarrow IG\left(\frac{m-1}{2},\frac{s^2}{2}+\frac{m}{2}(\mu^{'}-\bar{y})^2;\theta\right) \\ \Rightarrow \left\{ \begin{array}{c} IG(\frac{m-1}{2},\frac{s^2}{2}+\frac{md}{2} ; \theta ) \ \ if \ \theta < \theta^* \\IG(\frac{m-1}{2},\frac{s^2}{2} ; \theta ) \ \ if \ \theta \geq \theta^*\\ \end{array} \right.

where θ=md[(m1)log(1+md/s2)]1\theta^{*} = md[(m-1)log(1+md/s^2)]^{-1}

k((μ,θ),(μ,θ))π(μθ,y)g(θ)=ϵq(μ,θ)k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \pi(\mu | \theta,y)g(\theta) = \epsilon q(\mu,\theta)

Where q(μ,θ)=ϵ1π(μθ,y)g(θ)q(\mu , \theta) = \epsilon^{-1}\pi(\mu \| \theta,y)g(\theta)

Hence the Minorization conditions hold